C-3: Coherence and Coreference Corpus
نویسندگان
چکیده
The phenomenon of coreference, covering entities, their mentions and their properties, is intricately linked to the phenomenon of coherence, covering the structure of rhetorical relations in a discourse. A text corpus that has both phenomena annotated can be used to test hypotheses about their interrelation or to detect other phenomena. We present the process by which C-3, a new corpus, was obtained by annotating the Discourse GraphBank coherence corpus with entity and mention information. The annotation followed a set of ACE guidelines adapted to favour coreference and to include entities of unknown types in the annotation. Together with the corpus we offer a new annotation tool specifically designed to annotate entity and mention information within a simple and functional graphical interface that combines the “best of all worlds” from available annotation tools. The potential usefulness of C-3 is discussed, as well as an application in which the corpus proved to be a valuable resource.
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملCorefrence resolution with deep learning in the Persian Labnguage
Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...
متن کاملPolish Coreference Corpus
The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has some novel features, such as the annotation of...
متن کاملExtending the Entity-grid Coherence Model to Semantically Related Entities
This paper reports on work in progress on extending the entity-based approach on measuring coherence (Barzilay & Lapata, 2005; Lapata & Barzilay, 2005) from coreference to semantic relatedness. We use a corpus of manually annotated German newspaper text (TüBa-D/Z) and aim at improving the performance by grouping related entities with the WikiRelate! API (Strube & Ponzetto, 2006).
متن کاملDetection of Nested Mentions for Coreference Resolution in Polish
This paper describes the results of creating a shallow grammar of Polish capable of detecting multi-level nested nominal phrases, intended to be used as mentions in coreference resolution tasks. The work is based on existing grammar developed for the National Corpus of Polish and evaluated on manually annotated Polish Coreference Corpus.
متن کامل